dbtの公式入門ドキュメント『Quickstart for dbt Core from a manual install』を実践してみた #dbt

モダンデータスタック(MDS)

#dbt

#dbt Core

しんや

2023.09.03

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

直近、dbtのQuickｓtartシリーズで以下2本のエントリを公開しましたが、いずれもdbt Cloudに関するものでした。

dbtではその他に大きなプロダクトの柱がもう1つ存在します。それが『dbt Core』です。コマンドラインベースのオープンソースプロダクト(無償)でdbtの各種操作をCLIで実行します。当エントリではdbt Coreのクイックスタートチュートリアルとして用意されている『Quickstart for dbt Core from a manual install』の実践内容をお届けします。

01.はじめに

dbt Coreを使用してdbtを操作する場合、主にコードエディタを使用してローカルでファイルを編集し、dbtコマンドラインインタフェース(dbt CLI)を使ってプロジェクト作業を進めていきます。

dbt Coreをローカル環境にインストール

私が現在使っているのはMacOSですので、このローカル環境にdbt CLIをセットアップしていきます。幾つか手段が提供されている中で、今回はpipによるdbtインストールを選びました。

OSの確認：

% sw_vers
ProductName:		macOS
ProductVersion:		13.5.1
BuildVersion:		22G90

Python実行環境の確認(pyenvを利用してPython3.9環境を導入しました)：

% python --version
Python 3.9.18
% pyenv --version
pyenv 2.3.25
% pip --version
pip 23.2.1 from /Users/xxxxxxxxx/.pyenv/versions/3.9.18/lib/python3.9/site-packages/pip (python 3.9)

% pip install dbt-core
% pip install dbt-redshift
% pip install dbt-snowflake
% pip install dbt-bigquery

% dbt --version
Core:
  - installed: 1.6.1
  - latest:    1.6.1 - Up to date!

Plugins:
  - bigquery:  1.6.4 - Up to date!
  - snowflake: 1.6.2 - Up to date!
  - redshift:  1.6.1 - Up to date!
  - postgres:  1.6.1 - Up to date!

接続検証用のBigQuery環境を準備

また、ここではBigQuery環境を接続先DWHとして利用することにします。下記エントリを執筆した際に使った環境をここでも扱います。

接続検証用のGitリポジトリを準備

GitHubアカウントも利用出来る環境を予め用意しておきました。

02.スタータープロジェクトを作成

dbtで動作するようにBigQueryを設定した後、独自のモデルを構築する前に、サンプルモデルを含むスタータープロジェクトを作成する準備ができました。

Gitリポジトリの作成

下記手順に従い、プロジェクトで扱うGitリポジトリを作成します。この手順ではGitHubをGitプロバイダーとして扱いますが、dbtでサポートしているものであればどのGitHubリポジトリを使ってもらっても構いません。

(選択可能なリポジトリの一覧)：

GitHubにログインし、リポジトリ新規作成を行います。

任意のリポジトリ名を指定して作成します。今回はドキュメントに倣い、dbt-core-tutorial-bigqueryという名前にしてみました。ドキュメントでは「publicリポジトリで作成していいよ、後で変えられるし」とあったのですが、一時的にではあってもここでpublicにしておくアレも無いかなと思い、作成時点でPrivateとしています。その他は特に変更無し、デフォルト値で進めます。[Create Repository]を押下。

リポジトリが作成されました。後々使うことになるので「…or create a new repository on the command line」のコード内容は控えておきましょう。

dbtプロジェクトの作成

ここからはdbtプロジェクトをコマンドラインで作成していきます。予めプロジェクトを作成しておく親フォルダ(blog-verification)をローカル環境上に用意しておきました。dbtプロジェクトはこのフォルダ配下に作成していきます。

% pwd
/Users/xxxx.xxxxxxxxx/Desktop/blog-verification

% dbt --version
Core:
  - installed: 1.6.1
  - latest:    1.6.1 - Up to date!

Plugins:
  - bigquery:  1.6.4 - Up to date!
  - snowflake: 1.6.2 - Up to date!
  - redshift:  1.6.1 - Up to date!
  - postgres:  1.6.1 - Up to date!

dbt init jaffle_shopコマンドを実行しプロジェクト作成を進めます。以降、幾つかのパラメータ指示を求められますので順を追ってみていきます。

% dbt init jaffle_shop
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  [ConfigFolderDirectory]: Unable to parse dict {'dir': PosixPath('/Users/xxxx.xxxxxxxxx/.dbt')}
xx:xx:xx  Creating dbt configuration folder at 
xx:xx:xx  
Your new dbt project "jaffle_shop" was created!

For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:

  https://docs.getdbt.com/docs/configure-your-profile

One more thing:

Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:

  https://community.getdbt.com/

Happy modeling!

dbtプロジェクト作成にあたって接続を行うデータウェアハウスの種類を選択します。今回はBigQueryなので『1』を指定＆Enter押下で次へ。

xx:xx:xx  Setting up your profile.
Which database would you like to use?
[1] bigquery
[2] snowflake
[3] redshift
[4] postgres

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)

Enter a number: 1

認証方式を含め幾つか情報を聞かれますのでそれぞれ対応する数値や文字列を指定します。

認証方式：サービスアカウントキーを指定する形(『2』)を選択。
キーファイル：認証ファイルのパスを指定。下記エントリで作成したサービスアカウントキーの配置パスを指定。
- dbtの公式入門ドキュメント『Quickstart for dbt Cloud and BigQuery』を実践してみた #dbt | DevelopersIO
プロジェクトID：BigQueryデータセットを格納しているプロジェクトIDを指定。
データセット：プロジェクト作成にあたって作成したいデータセット名を指定。dbt Cloudではユーザー名から自動生成されていた箇所。今回は任意の名前を入力。
スレッド：1以上って指定があったので今回はそのまま1を指定。
ジョブ実行の際のタイムアウト時間：ここも上記同様、そのまま300を指定。
データセットのロケーション：本当はアジア(東京)が良かったんだけど二択しかなかったのでUSを選択。

ここまで指定してEnter押下を続けていくと、最終的にファイルが生成されていきます。

[1] oauth
[2] service_account
Desired authentication method option (enter a number): 2
keyfile (/path/to/bigquery/keyfile.json): /Users/xxxx.xxxxxxxxx/Desktop/xxxx/xxxx/xxxxxx/dbt-user-shinyaa31-creds.json
project (GCP project id): xxxxxxxxxxxxxxxx
dataset (the name of your dbt dataset): dbtcore_shinyaa31_dev
threads (1 or more): 1
job_execution_timeout_seconds [300]: 300
[1] US
[2] EU
Desired location option (enter a number): 1
xx:xx:xx  Profile jaffle_shop written to /Users/xxxx.xxxxxxxxx/.dbt/profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.

dbt debugコマンドで状態の検証が出来るよ！というメッセージが表示されているので、このタイミングで試してみましょう。色々なチェックが行われ、最終的に『All checks passed!』というメッセージが表示されました。

% dbt debug
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  dbt version: 1.6.1
xx:xx:xx  python version: 3.9.18
xx:xx:xx  python path: /Users/xxxx.xxxxxxxxx/.pyenv/versions/3.9.18/bin/python3.9
xx:xx:xx  os info: macOS-13.5.1-x86_64-i386-64bit
xx:xx:xx  Using profiles dir at /Users/xxxx.xxxxxxxxx/.dbt
xx:xx:xx  Using profiles.yml file at /Users/xxxx.xxxxxxxxx/.dbt/profiles.yml
xx:xx:xx  Using dbt_project.yml file at /Users/xxxx.xxxxxxxxx/xxxxxxxxxx/blog-verification/jaffle_shop/dbt_project.yml
xx:xx:xx  adapter type: bigquery
xx:xx:xx  adapter version: 1.6.4
xx:xx:xx  Configuration:
xx:xx:xx    profiles.yml file [OK found and valid]
xx:xx:xx    dbt_project.yml file [OK found and valid]
xx:xx:xx  Required dependencies:
xx:xx:xx   - git [OK found]

xx:xx:xx  Connection:
xx:xx:xx    method: service-account
xx:xx:xx    database: xxxxxxxxxxxxxxxx
xx:xx:xx    execution_project: xxxxxxxxxxxxxxxx
xx:xx:xx    schema: dbtcore_shinyaa31_dev
xx:xx:xx    location: US
xx:xx:xx    priority: interactive
xx:xx:xx    maximum_bytes_billed: None
xx:xx:xx    impersonate_service_account: None
xx:xx:xx    job_retry_deadline_seconds: None
xx:xx:xx    job_retries: 1
xx:xx:xx    job_creation_timeout_seconds: None
xx:xx:xx    job_execution_timeout_seconds: 300
xx:xx:xx    keyfile: /Users/xxxx.xxxxxxxxx/xxxxxxxxxx/xxxxxxxx/dbt-user-shinyaa31-creds.json
xx:xx:xx    timeout_seconds: 300
xx:xx:xx    refresh_token: None
xx:xx:xx    client_id: None
xx:xx:xx    token_uri: None
xx:xx:xx    dataproc_region: None
xx:xx:xx    dataproc_cluster_name: None
xx:xx:xx    gcs_bucket: None
xx:xx:xx    dataproc_batch: None
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx   Connection test: [OK connection ok]

xx:xx:xx  All checks passed!

OS環境からも、生成されたファイル群を確認してみました。

% pwd
/Users/xxxx.xxxxxxxxx/Desktop/blog-verification
% ls
jaffle_shop	logs
% cd jaffle_shop 
% pwd
/Users/xxxx.xxxxxxxxx/Desktop/blog-verification/jaffle_shop
% ls -lta
total 24
-rw-r--r--   1 xxxx.xxxxxxxxx  staff  1265  9  3 xx:xx dbt_project.yml
drwxr-xr-x   4 xxxx.xxxxxxxxx  staff   128  9  3 xx:xx ..
drwxr-xr-x  11 xxxx.xxxxxxxxx  staff   352  9  1 xx:xx .
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx tests
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx snapshots
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx seeds
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx models
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx macros
drwxr-xr-x   3 xxxx.xxxxxxxxx  staff    96  9  1 xx:xx analyses
-rw-r--r--   1 xxxx.xxxxxxxxx  staff   571  9  1 xx:xx README.md
-rw-r--r--   1 xxxx.xxxxxxxxx  staff    29  9  1 xx:xx .gitignore

任意のエディタツールでプロジェクトのファイルを開いて中身を確認していきます。今回はエディタツールにVisual Studio Codeを用いました。アプリ起動後、[フォルダーを開く]から上述手順で作成したjaffle_shopフォルダを指定。

Visual Studio Codeからdbtプロジェクトを参照することが出来ました。

チュートリアルでは以下の箇所を『jaffle_shop』にそれぞれ修正しておきましょう。という案内でしたが、今回手順に則った形ではそのまま『jaffle_shop』と設定されていました。

name: jaffle_shop # Change from the default, `my_new_project`

...

profile: jaffle_shop # Change from the default profile name, `default`

...

models:
  jaffle_shop: # Change from `my_new_project` to match the previous value for `name:`
    ...

BigQuery環境への接続(内容を確認)

dbt Coreを使う＝dbtをローカルで扱う場合、dbtはデータウェアハウスへの接続をprofileという仕組みを使って行います。この設定はYAMLファイルで構成されており、中身は接続に関する(詳細)情報が記載されています。

参照したチュートリアルではこのタイミングで『データウェアハウスの接続を新規作成し、検証(dbt debug)してみよう』という流れになっていましたが、この部分は上述での手順でも見た通り、dbtプロジェクト作成の流れで合わせて指定、選択して進められるようにアップデートされているようです。

~/.dbt/profiles.ymlの内容を確認してみます。確かにYAMLで先程指示した内容が記載されていますね。この情報を使って、先程のdbt debugコマンド実行時にDB接続も行われており、結果成功もしています(42行目)。

% cat ~/.dbt/profiles.yml 
jaffle_shop:
  outputs:
    dev:
      dataset: dbtcore_shinyaa31_dev
      job_execution_timeout_seconds: 300
      job_retries: 1
      keyfile: /Users/xxxx.xxxxxxxxx/Desktop/xxxxxxxxxx/xxxxxxxxx/dbt-user-shinyaa31-creds.json
      location: US
      method: service-account
      priority: interactive
      project: xxxxxxxxx
      threads: 1
      type: bigquery
  target: dev

dbt CLIによるはじめてのdbt run実行

dbt runをここで始めて実行してみましょう。まぁ結果はdbt Cloudの時と変わらない内容ではありますが。

% pwd 
/Users/xxxx.xxxxxxxxx/Desktop/blog-verification/jaffle_shop
% dbt run
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Unable to do partial parsing because saved manifest not found. Starting full parse.
xx:xx:xx  Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 2 START sql table model dbtcore_shinyaa31_dev.my_first_dbt_model .......... [RUN]
xx:xx:xx  1 of 2 OK created sql table model dbtcore_shinyaa31_dev.my_first_dbt_model ..... [CREATE TABLE (2.0 rows, 0 processed) in 7.91s]
xx:xx:xx  2 of 2 START sql view model dbtcore_shinyaa31_dev.my_second_dbt_model .......... [RUN]
xx:xx:xx  2 of 2 OK created sql view model dbtcore_shinyaa31_dev.my_second_dbt_model ..... [CREATE VIEW (0 processed) in 4.42s]
xx:xx:xx  
xx:xx:xx  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 23.01 seconds (23.01s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

ここまでの内容をリポジトリにコミット

ここまでの作業内容をリポジトリに反映します。リポジトリ作成時に控えておいた一連のgitコマンド操作をここで実行します。(※gitへのアクセスの際はリモートで認証が通るように予め設定を見直し、対応しておいてください。)

% echo "# dbt-core-tutorial-bigquery" >> README.md
% git init
Initialized empty Git repository in /Users/xxxx.xxxxxxxxx/Desktop/blog-verification/jaffle_shop/.git/
% git add README.md
% git commit -m "first commit"
[main (root-commit) xxxxxxxx] first commit
 1 file changed, 16 insertions(+)
 create mode 100644 README.md
% git branch -M main
% git remote add origin https://github.com/xxxx.xxxxxxxxx/dbt-core-tutorial-bigquery.git
% git push -u origin main
Username for 'https://github.com': xxxx.xxxxxxxxx
Password for 'https://xxxx.xxxxxxxxx@github.com': 
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 16 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 543 bytes | 543.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/xxxx.xxxxxxxxx/dbt-core-tutorial-bigquery.git
 * [new branch]      main -> main
branch 'main' set up to track 'origin/main'.

作成したリポジトリにファイルが登録されていることを確認します。(よくよく見るとリポジトリ作成時に『控えておいてね』と記載されていたgitコマンドテキストの内容、addするファイルはREADME.mdのみの指定だったので追加コミットでプロジェクト配下のファイル群を指定し、諸々ファイルをリポジトリに追加反映させました。

03.はじめてのモデル作成 by dbt Core

はじめてのモデル実行

サンプルプロジェクトをセットアップしたところで、モデルの構築に取り掛かります！サンプルのクエリをdbtプロジェクトでモデル化します。

checkoutコマンドを使い、-bフラグを渡して新しいブランチを作成。

% pwd
/Users/xxxx.xxxxxxxxx/Desktop/blog-verification/jaffle_shop
% git checkout -b add-customers-model
Switched to a new branch 'add-customers-model'

modelsディレクトリ配下にcustomers.sqlファイルを作成。

with customers as (

    select
        id as customer_id,
        first_name,
        last_name

    from `dbt-tutorial`.jaffle_shop.customers

),

orders as (

    select
        id as order_id,
        user_id as customer_id,
        order_date,
        status

    from `dbt-tutorial`.jaffle_shop.orders

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final

dbt runコマンド実行。作成したモデルに関する処理が行われます。

% dbt run
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 3 START sql view model dbtcore_shinyaa31_dev.customers .................... [RUN]
xx:xx:xx  1 of 3 OK created sql view model dbtcore_shinyaa31_dev.customers ............... [CREATE VIEW (0 processed) in 4.66s]
xx:xx:xx  2 of 3 START sql table model dbtcore_shinyaa31_dev.my_first_dbt_model .......... [RUN]
xx:xx:xx  2 of 3 OK created sql table model dbtcore_shinyaa31_dev.my_first_dbt_model ..... [CREATE TABLE (2.0 rows, 0 processed) in 7.67s]
xx:xx:xx  3 of 3 START sql view model dbtcore_shinyaa31_dev.my_second_dbt_model .......... [RUN]
xx:xx:xx  3 of 3 OK created sql view model dbtcore_shinyaa31_dev.my_second_dbt_model ..... [CREATE VIEW (0 processed) in 4.37s]
xx:xx:xx  
xx:xx:xx  Finished running 2 view models, 1 table model in 0 hours 0 minutes and 23.02 seconds (23.02s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

BigQueryコンソール上でも確認。後述する設定により、customersがビューで作成されていることが確認出来ました。

モデルのマテリアライズ方法を変更

モデルのマテリアライズの在り方をdbtでは設定ファイルの値を変更することで切り替える事ができます！というこのパート。dbt CloudのBigQuery版、Snowflake版と同じ方法で

dbt_project.ymlの設定を変更
dbt run実行

の流れに倣いました。

設定ファイルの更新：

name: 'jaffle_shop'
:
:
models:
  jaffle_shop:
    +materialized: table
    example:
      +materialized: view

dbt runコマンドの実行。設定ファイルの変更に基づき、対象テーブルがテーブルとして再作成されていることが確認出来ます。

% dbt run --full-refresh
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 3 START sql table model dbtcore_shinyaa31_dev.customers ................... [RUN]
xx:xx:xx  1 of 3 OK created sql table model dbtcore_shinyaa31_dev.customers .............. [CREATE TABLE (100.0 rows, 4.3 KiB processed) in 7.49s]
xx:xx:xx  2 of 3 START sql table model dbtcore_shinyaa31_dev.my_first_dbt_model .......... [RUN]
xx:xx:xx  2 of 3 OK created sql table model dbtcore_shinyaa31_dev.my_first_dbt_model ..... [CREATE TABLE (2.0 rows, 0 processed) in 6.55s]
xx:xx:xx  3 of 3 START sql view model dbtcore_shinyaa31_dev.my_second_dbt_model .......... [RUN]
xx:xx:xx  3 of 3 OK created sql view model dbtcore_shinyaa31_dev.my_second_dbt_model ..... [CREATE VIEW (0 processed) in 4.23s]
xx:xx:xx  
xx:xx:xx  Finished running 2 table models, 1 view model in 0 hours 0 minutes and 24.11 seconds (24.11s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

BigQueryコンソール上でも、先程はビューとして作成されていたものがdbtコマンド実行を経てテーブルとして再作成されていることが確認出来ました。

マテリアライズ方式の設定はモデル個別のファイルに以下のように記載を追記することで、dbt_project.ymlで指定した内容を上書き反映することが出来ます。

ビューとして作成させたい場合：

{{ config(materialized='view') }}

with customers as (
:
:

テーブルとして作成させたい場合：

{{ config(materialized='table') }}

with customers as (
:
:

サンプルモデルの削除

サンプルとして提供されていたexample配下のファイルはここで御役御免。フォルダごと作成しdbt runコマンドを再実行。削除した分のモデル作成が行われないことを確認。

% rm -rf models/example 
% dbt run --full-refresh

他のモデルの上にモデルを構築する

モデル間の依存関係を踏まえた手順の実行。2つの新しいモデルを作成し、そのモデルを参照させる形にファイル・構成を変更します。ドキュメントに倣い、都合3つのファイルを追加＆変更。

% touch models/stg_customers.sql
% vi models/stg_customers.sql 
% touch models/stg_orders.sql
% vi models/stg_orders.sql 
% vi models/customers.sql

models/stg_customers.sqlのファイル内容(新規追加)：

select
    id as customer_id,
    first_name,
    last_name

from `dbt-tutorial`.jaffle_shop.customers

models/stg_orders.sqlのファイル内容(新規追加)：

select
    id as order_id,
    user_id as customer_id,
    order_date,
    status

from `dbt-tutorial`.jaffle_shop.orders

model/customers.sqlのファイル内容(変更)：

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final

dbt runコマンド実行。依存関係を踏まえてcustomer.sqlに関するビルド処理が最後に処理されていることがわかります。

% dbt run --full-refresh        
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 3 START sql table model dbtcore_shinyaa31_dev.stg_customers ............... [RUN]
xx:xx:xx  1 of 3 OK created sql table model dbtcore_shinyaa31_dev.stg_customers .......... [CREATE TABLE (100.0 rows, 1.9 KiB processed) in 6.38s]
xx:xx:xx  2 of 3 START sql table model dbtcore_shinyaa31_dev.stg_orders .................. [RUN]
xx:xx:xx  2 of 3 OK created sql table model dbtcore_shinyaa31_dev.stg_orders ............. [CREATE TABLE (99.0 rows, 3.3 KiB processed) in 6.68s]
xx:xx:xx  3 of 3 START sql table model dbtcore_shinyaa31_dev.customers ................... [RUN]
xx:xx:xx  3 of 3 OK created sql table model dbtcore_shinyaa31_dev.customers .............. [CREATE TABLE (100.0 rows, 4.3 KiB processed) in 7.16s]
xx:xx:xx  
xx:xx:xx  Finished running 3 table models in 0 hours 0 minutes and 26.99 seconds (26.99s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

次のステップに行く前に

ドキュメントでは、次のステップに進む前に幾つかのトピックに言及しています。ここではそれらトピックにも軽く言及しておこうと思います。

SQLのエラーに関して

記載したSQLにエラーがあった場合どういう挙動を示すか？というもの。ここではシンプルに下記記載のような形で、存在しないカラム名を指定してみます。

select
    id as customer_id,
    first_namex,
    last_name

from `dbt-tutorial`.jaffle_shop.customers

dbt runコマンド実行。想定通り対象箇所でエラーが発生し、エラーが発生したモデルを参照している部分は処理がスキップされました。(関係無いところはそのまま実行されている)

SQLに関するエラーもログとして記載されているので、これらの情報をもとにエラーを特定出来そうですね。

xx:xx:xx  1 of 3 START sql table model dbtcore_shinyaa31_dev.stg_customers ............... [RUN]
xx:xx:xx  BigQuery adapter: https://console.cloud.google.com/bigquery?project=xxxx.xxxxxxxxx&j=bq:US:xxxxxxxxxxxxxxxxxxx&page=queryresults
xx:xx:xx  1 of 3 ERROR creating sql table model dbtcore_shinyaa31_dev.stg_customers ...... [ERROR in 8.60s]
xx:xx:xx  2 of 3 START sql table model dbtcore_shinyaa31_dev.stg_orders .................. [RUN]
xx:xx:xx  2 of 3 OK created sql table model dbtcore_shinyaa31_dev.stg_orders ............. [CREATE TABLE (99.0 rows, 3.3 KiB processed) in 7.28s]
xx:xx:xx  3 of 3 SKIP relation dbtcore_shinyaa31_dev.customers ........................... [SKIP]
xx:xx:xx  
xx:xx:xx  Finished running 3 table models in 0 hours 0 minutes and 21.95 seconds (21.95s).
xx:xx:xx  
xx:xx:xx  Completed with 1 error and 0 warnings:
xx:xx:xx  
xx:xx:xx  Database Error in model stg_customers (models/stg_customers.sql)
xx:xx:xx    Unrecognized name: first_namex at [15:5]
xx:xx:xx   compiled Code at target/run/jaffle_shop/models/stg_customers.sql

単一モデルのみの実行を行う

--selectオプションを使うと単一モデルの実行が出来る。その他色々な条件指定、除外が出来るっぽいです。この辺はエントリを改めて深掘りしてみたいと思います。

% dbt run --select stg_customers 
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 1 START sql table model dbtcore_shinyaa31_dev.stg_customers ............... [RUN]
xx:xx:xx  1 of 1 OK created sql table model dbtcore_shinyaa31_dev.stg_customers .......... [CREATE TABLE (100.0 rows, 1.9 KiB processed) in 7.95s]
xx:xx:xx  
xx:xx:xx  Finished running 1 table model in 0 hours 0 minutes and 14.32 seconds (14.32s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

フォルダ構成の整理

構成のリファクタリング的なトピック。チュートリアルで作成したファイル群については用途目的で整理出来るのでフォルダを設けてそこに移動させましょうね、というもの。整理後dbt runコマンドを実行して問題なく処理が行われていることを確認しました。

% mkdir models/staging
% mv models/stg*.sql models/staging 
% ls -lta models 
total 8
drwxr-xr-x  4 xxxx.xxxxxxxx  staff  128  9  3 xx:xx staging
-rw-r--r--@ 1 xxxx.xxxxxxxx staff  737  9  3 xx:xx customers.sql
% ls -lta models/staging 
total 16
-rw-r--r--  1 xxxx.xxxxxxxx  staff  104  9  3 xx:xx stg_customers.sql
-rw-r--r--  1 xxxx.xxxxxxxx  staff  123  9  3 xx:xx stg_orders.sql
% dbt run --full-refresh

`target`フォルダ配下にはコンパイルされたSQLファイルがある

というお話。実際に稼働するSQLの中身を見ることが出来ます。

% ls  target/compiled/jaffle_shop/models
customers.sql		example			staging			stg_customers.sql	stg_orders.sql

dbtの実行ログについて

logs ファイルは dbt Core がプロジェクト内でどのように動作しているかを記録します。実行されている select ステートメントと dbt が実行された時に発生する python ロギングが表示されます。何か調査をするときはこのログファイルを当たるのが良さそうです。

% ls logs
dbt.log
% tail -f logs/dbt.log

04.プロジェクトに対してテストとドキュメント作成を行う

プロジェクトのテストを実行

作成したモデルに対してテストを実行してみます。models/schema.sqlファイルを新たに作成。

version: 2

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

dbt testコマンドを実行、全てのテストが通ることを確認出来ました。

% dbt test
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 9 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 9 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned  [RUN]
xx:xx:xx  1 of 9 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned  [PASS in 5.46s]
xx:xx:xx  2 of 9 START test not_null_customers_customer_id ............................... [RUN]
xx:xx:xx  2 of 9 PASS not_null_customers_customer_id ..................................... [PASS in 5.44s]
xx:xx:xx  3 of 9 START test not_null_stg_customers_customer_id ........................... [RUN]
xx:xx:xx  3 of 9 PASS not_null_stg_customers_customer_id ................................. [PASS in 4.93s]
xx:xx:xx  4 of 9 START test not_null_stg_orders_customer_id .............................. [RUN]
xx:xx:xx  4 of 9 PASS not_null_stg_orders_customer_id .................................... [PASS in 5.57s]
xx:xx:xx  5 of 9 START test not_null_stg_orders_order_id ................................. [RUN]
xx:xx:xx  5 of 9 PASS not_null_stg_orders_order_id ....................................... [PASS in 5.55s]
xx:xx:xx  6 of 9 START test relationships_stg_orders_customer_id__customer_id__ref_stg_customers_  [RUN]
xx:xx:xx  6 of 9 PASS relationships_stg_orders_customer_id__customer_id__ref_stg_customers_  [PASS in 5.62s]
xx:xx:xx  7 of 9 START test unique_customers_customer_id ................................. [RUN]
xx:xx:xx  7 of 9 PASS unique_customers_customer_id ....................................... [PASS in 5.12s]
xx:xx:xx  8 of 9 START test unique_stg_customers_customer_id ............................. [RUN]
xx:xx:xx  8 of 9 PASS unique_stg_customers_customer_id ................................... [PASS in 5.09s]
xx:xx:xx  9 of 9 START test unique_stg_orders_order_id ................................... [RUN]
xx:xx:xx  9 of 9 PASS unique_stg_orders_order_id ......................................... [PASS in 4.93s]
xx:xx:xx  
xx:xx:xx  Finished running 9 tests in 0 hours 0 minutes and 51.06 seconds (51.06s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=9 WARN=0 ERROR=0 SKIP=0 TOTAL=9

プロジェクトのドキュメントを作成

ついで、プロジェクトに対するドキュメント作成を行います。先程作成したmodels/schema.ymlを以下の内容に修正。

version: 2

models:
  - name: customers
    description: One record per customer
    columns:
      - name: customer_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: first_order_date
        description: NULL when a customer has not yet placed an order.

  - name: stg_customers
    description: This model cleans up customer data
    columns:
      - name: customer_id
        description: Primary key
        tests:
          - unique
          - not_null

  - name: stg_orders
    description: This model cleans up order data
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

dbt docs generateコマンドを実行し、処理が正常に完了することを確認。

% dbt docs generate
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 7 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx
xx:xx:xx  Building catalog
xx:xx:xx  Catalog written to /Users/xxxx.xxxxxxx/Desktop/blog-verification/jaffle_shop/target/catalog.json

dbt docs serveコマンドを実行すると、ローカルホスト環境でドキュメントを参照することが可能になります。

% dbt docs serve
xx:x:xx  Running with dbt=1.6.1
Serving docs at 8080
To access from your browser, navigate to: http://localhost:8080



Press Ctrl+C to exit.
127.0.0.1 - - [03/Sep/2023 16:35:33] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [03/Sep/2023 16:35:33] "GET /manifest.json?cb=1693726533435 HTTP/1.1" 200 -
127.0.0.1 - - [03/Sep/2023 16:35:33] "GET /catalog.json?cb=1693726533435 HTTP/1.1" 200 -

次のステップに行く前に

前ステップ同様、当ステップでも補足的に幾つかトピックについて言及されています。

テストがコケたときの対処

dbt testで実行されるテストにおいて失敗が発生したときにどうなるか？デバッグ出来るか？というもの。こちらについてはテストの内容そのものを把握していないのもあるので、別途エントリを改めて深掘りしてみたいと思います。

特定のモデルに対してのみテストを実行

dbt runコマンド同様、テストに於いても個別のモデルに対してのみ実行を仕掛けることが可能です。

% dbt test --select stg_orders
xx:xx:xx  Running with dbt=1.6.1
xx:xx:xx  Registered adapter: bigquery=1.6.4
xx:xx:xx  Found 3 models, 7 tests, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
xx:xx:xx  
xx:xx:xx  Concurrency: 1 threads (target='dev')
xx:xx:xx  
xx:xx:xx  1 of 3 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned  [RUN]
xx:xx:xx  1 of 3 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned  [PASS in 5.02s]
xx:xx:xx  2 of 3 START test not_null_stg_orders_order_id ................................. [RUN]
xx:xx:xx  2 of 3 PASS not_null_stg_orders_order_id ....................................... [PASS in 5.06s]
xx:xx:xx  3 of 3 START test unique_stg_orders_order_id ................................... [RUN]
xx:xx:xx  3 of 3 PASS unique_stg_orders_order_id ......................................... [PASS in 4.89s]
xx:xx:xx  
xx:xx:xx  Finished running 3 tests in 0 hours 0 minutes and 18.51 seconds (18.51s).
xx:xx:xx  
xx:xx:xx  Completed successfully
xx:xx:xx  
xx:xx:xx  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

docsブロックを使用してモデルにMarkdownの説明を追加

これも少しボリューム感ありそうなので別エントリで言及したいと思います。

ここまでの変更内容をリポジトリに反映

ドキュメントの内容に従い、ここまでの変更をGitリポジトリに反映。

05.ジョブのスケジューリング

dbtでは、ジョブのスケジュールについてはdbt Cloudを使うことをお勧めしているようです。なのでチュートリアルの実施としては一旦ここで締め。

また他には、dbt Coreを使用したジョブのスケジュールについては下記情報が展開されています。dbt Core単独ではなく、何らかのサービスと連携させて行きましょう、というのがdbtの方針となっているようです。

まとめ

という訳で、dbtのクイックスタートチュートリアル：dbt Coreのマニュアルインストール版の実践内容紹介でした。dbt Cloud版とは異なり、全てがコマンド実行による手順となるため煩雑さ、プログラミング的素養や知識が求められる「ハードルの高さ」は少々感じる部分はありました。一方でそのハードルを超えられるのであれば、dbt Coreはオープンソースであること＝無料でdbtの強力なパワーを享受出来るのはとても魅力的でもあるな、と思った次第です。

また、dbt Cloud版、dbt Core版、更にはCore版の最後で「ジョブスケジュールツールとの連携」が出てきたことで「Core版とCloud版の機能差異、またdbtと他サービスを連携したほうが良さそうなケース」の場合分け的な考え方、判断基準なども自分なりに整理しておこうとも思いました。このあたりはdbtの機能を触りまくり、アウトプットも継続して続けることで実現、実施できればと思います。